Customize your output

Overview

Dataprep supports customizability for plot(), plot_missing(), plot_correlation() and create_report(). There are mainly two settings, display and config.

  1. display is a list of names which controls the Tabs, Sections and Sessions you want to show.

  2. config is a dictionary that contains the customizable parameters and designated values.

For your convenience, the input for display can directly be copied from the Tabs. You can save the computations by diaplaying less contents.

And for config, we developed the how-to guide function to help you mange the frequently-use parameters. Click the question mark icon in the upper right corner of each plot, in the pop-up you will see the customizable parameters for that plot, parameters’ brief descriptions and parameters’ default settings. You can easily use the Copy All Parameters button to copy the parameters with their default settings into a dictionary. Then customize the settings and pass to the config argument.

Global and local parameter

There are two types of parameters, global and local.

  1. Local parameters are plot-specified and the names are separated by .. The portion before the first . is plot name and the portion after the . is parameter name. e.g. bar.bars.

  2. Global parameter applies to all the plots which has that parameter. It is single-word. e.g. ngroups .

When global and local parameter are both given, the global parameter will be overwrote by local parameters for specific plots. You can find more details about parameters in parameter_configurations.

Exmaple 1: Choose the Tabs, Sections and Sessions you want

[1]:
from dataprep.eda import plot,create_report
from dataprep.datasets import load_dataset
df = load_dataset('titanic')
plot(df, 'Pclass', display=['Stats', 'Bar Chart', 'Pie Chart'])
[1]:
DataPrep.EDA Report

Overview

Distinct Count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size13.9 KB
Mean2.3086
Minimum1
Maximum3
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum1
5-th Percentile1
Q12
Median3
Q33
95-th Percentile3
Maximum3
Range2
IQR1

Descriptive Statistics

Mean2.3086
Standard Deviation0.8361
Variance0.699
Sum2057
Skewness-0.6295
Kurtosis-1.2796
Coefficient of Variation0.3621
[2]:
create_report(df,display=["Overview","Interactions"])
[2]:
DataPrep Report

Overview

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 646.0 KB
Average Row Size in Memory 742.4 B

Variable Types

Categorical 12

Interactions

[3]:
plot(df, display=["Stats", "Insights"])
[3]:
DataPrep.EDA Report

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 177
Missing Cells (%) 1.7%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 333.9 KB
Average Row Size in Memory 383.7 B
Variable Types
  • Numerical: 7
  • Categorical: 5

Dataset Insights

PassengerId is uniformly distributed Uniform
Age has 177 (19.87%) missing values Missing
Survived is skewed Skewed
Pclass is skewed Skewed
SibSp is skewed Skewed
Parch is skewed Skewed
Fare is skewed Skewed
Age has 177 (19.87%) infinite values Infinity
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality

Dataset Insights

Cabin has a high cardinality: 148 distinct values High Cardinality
Name has all distinct values Unique
Survived has 549 (61.62%) zeros Zeros
SibSp has 608 (68.24%) zeros Zeros
Parch has 678 (76.09%) zeros Zeros
  • 1
  • 2

Example 2: Customize your plot

[4]:
plot(df, "Pclass", config={'bar.bars': 10, 'bar.sort_descending': True, 'bar.yscale': 'linear', 'height': 400, 'width': 450, })
[4]:
DataPrep.EDA Report

Overview

Distinct Count3
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size13.9 KB
Mean2.3086
Minimum1
Maximum3
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum1
5-th Percentile1
Q12
Median3
Q33
95-th Percentile3
Maximum3
Range2
IQR1

Descriptive Statistics

Mean2.3086
Standard Deviation0.8361
Variance0.699
Sum2057
Skewness-0.6295
Kurtosis-1.2796
Coefficient of Variation0.3621
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • Pclass is skewed left (γ1 = -0.6295)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
  • Pclass is not normally distributed (p-value 7.13662382541816e-20)
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot

Example 3: Customize your Insights

[5]:
plot(df,config={'insight.missing.threshold':20, 'insight.duplicates.threshold':20})
[5]:
DataPrep.EDA Report

Dataset Statistics

Number of Variables 12
Number of Rows 891
Missing Cells 177
Missing Cells (%) 1.7%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 333.9 KB
Average Row Size in Memory 383.7 B
Variable Types
  • Numerical: 7
  • Categorical: 5

Dataset Insights

PassengerId is uniformly distributed Uniform
Survived is skewed Skewed
Pclass is skewed Skewed
SibSp is skewed Skewed
Parch is skewed Skewed
Fare is skewed Skewed
Age has 177 (19.87%) infinite values Infinity
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 148 distinct values High Cardinality

Dataset Insights

Name has all distinct values Unique
Survived has 549 (61.62%) zeros Zeros
SibSp has 608 (68.24%) zeros Zeros
Parch has 678 (76.09%) zeros Zeros
  • 1
  • 2